The three big questions for evaluating a model are: 1)Is the model fair? Always ask this question and evaluate by questioning how data was collected, by whom and for what, and how might data affect individuals/communities.
How wrong is the model? We can use pp-check() to see if central tendency and spread of model is similar to data.
How accurate are the posterior predictive models? We can use posterior predictive summaries to assess accuracy.
Unfairness in data collection - Only collecting data on political preferences from people who work on Capitol Hill.
Unfairness in purpose of data collection - Using a model to determine placement in AP classes for incoming high school students (thinking of tracking and how many students and parents don’t know when and how a student is placed in a certain tack that affects post secondary outcomes; I only learned about my track/tracking in general in a coincidental conversation with admin in early high school. More on tracking: https://www.nassp.org/tracking-and-ability-grouping-in-middle-level-and-high-schools/)
Unfairness in the impact of analysis on society - using a model to predict housing prices that overestimates prices and raises housing costs (think of Zillow scandal)
Bias baked into the analysis - collecting data on demographics (race, BMI, gender) to create a model of survivorship on different medications to assign treatment to future patients based on demographics. Such a model bakes in racism, sexism, and fatphobia that affects survivorship.
My perspective is influenced by being a 1.5 generation immigrant and Black/African woman living in the South.
There are many American subcultures for which I don’t have a complete or near-complete picture of. I also don’t have an inside view of how men socialize.
I have more insight on the immigration process than those who’ve never gone through it and living in the South gives me a truer sense of Southern cultures than Northeastern media simplifications.
I would tell the colleague they were socialized and educated in a society experiencing coloniality that tries to naturalize various -isms and science and data analysis is not immune to this hegemony and that it can be argued that data science helps perpetuate it, so therefore they’re not neutral. I could urge the colleague to check the three big questions at minimum and more questions like am I measuring race or racism, woman-dependent outcomes or sexism, etc.
I would respond that the model, unless efforts have been made to check for and remove bias in its data (garbage data in, garbage model out), data structure, and the premise and purpose of what is being modeled, the model will have bias baked in.
I once worked as a research assistant for a global health study exploring how youth understand HIV/AIDS transmission and we worked with narratives.I was able to provide some additional context to the research team about some content in the narratives from Kenyan youth; I know someone who grew up in Kenya would have probably given even better context, but even raised abroad I was able to give some insight.
This quote means that models are ideal representations of a real phenomena so they don’t capture all the nuances. Models that are less wrong than others give useful approximations that give a good sense of summary statistics that may be close to actual parameters.
The three assumptions of the Normal Bayesian linear regression model are: 1) Conditioned on X, the observed data Y on case i is independent of the observed data on any other case j.
The typical Y outcome can be written as a linear function X, mu = B0 + B1X.
At any X value, Y varies normally around mu with consistent variability sigma.
#loading libraries for exercises
library(bayesrules)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.5 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.4 ✓ stringr 1.4.0
## ✓ readr 2.0.2 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(bayesplot)
## This is bayesplot version 1.8.1
## - Online documentation and vignettes at mc-stan.org/bayesplot
## - bayesplot theme set to bayesplot::theme_default()
## * Does _not_ affect other ggplot2 plots
## * See ?bayesplot_theme_set for details on theme setting
library(rstanarm)
## Loading required package: Rcpp
## This is rstanarm version 2.21.1
## - See https://mc-stan.org/rstanarm/articles/priors for changes to default priors!
## - Default priors may change, so it's safest to specify priors, even if equivalent to the defaults.
## - For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores())
library(rstan)
## Loading required package: StanHeaders
## rstan (Version 2.21.3, GitRev: 2e1f913d3ca3)
## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)
##
## Attaching package: 'rstan'
## The following object is masked from 'package:tidyr':
##
## extract
library(tidybayes)
library(janitor)
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(broom.mixed)
See below.
#creating vectors
x <- c(12,10,4,8,6)
y <- c(20,17,4,11,9)
line_data <- data.frame(x,y)
line_data #looking at data.frame to get a sense of the tendency and spread
## x y
## 1 12 20
## 2 10 17
## 3 4 4
## 4 8 11
## 5 6 9
plot(x,y)
#create simple linear regression with the given data
lm(y ~ x, data = line_data) #slope is the 2nd coefficient based on googling and this is just 1 model of these points.
##
## Call:
## lm(formula = y ~ x, data = line_data)
##
## Coefficients:
## (Intercept) x
## -3.8 2.0
line_model <- stan_glm(y ~ x, data = line_data, #create simulation with given parameter set
family = gaussian,
prior_intercept = normal(-1.8),
prior = normal(2.1, .8),
prior_aux = exponential(1.25), #arrived at 1.25 wiht 1/.8
chains = 4, iter = 5000*2, seed = 80)
beta_0 <- -1.8 #these values given as first parameter set
beta_1 <- 2.1
sigma <- .8
set.seed(678)
y_simulation <- line_data |>
mutate(mu = beta_0 + beta_1 * x,
simulated_y = rnorm(5, mean = mu, sd = sigma)) |> #we have 5 data points
select(x, y, simulated_y)
head(y_simulation, 5) #Compare data to simulation.Part b) The predictions for Y1:Y5 are shown in the column labeled simulated_y.
## x y simulated_y
## 1 12 20 22.781211
## 2 10 17 19.946564
## 3 4 4 6.973068
## 4 8 11 14.132278
## 5 6 9 9.075090
Alternatively could have done this more simply with vectors and this would have produced the estimates Steve showed in class. These estimates are similar to the ones above. The ones above are within 1 or 2 sigma of the below estimates.
y_estimates <- beta_0 + beta_1*x
y_estimates #y estimates with N~(-1.8 + 2.1x_i, 0.8)
## [1] 23.4 19.2 6.6 15.0 10.8
#comparing data to simulated y's visually
ggplot(y_simulation, aes(x = simulated_y)) +
geom_density(color = "red") +
geom_density(aes(x = y), color = "darkblue")
#simulation and data follow similar rises and falls. I feel comfortable generating more values based on this. For the majority of values the simulation is an underestimate.
# Examine 80 simulated samples
pp_check(line_model, nreps = 80) +
xlab("lines")
#This plot shows that most of the lines capture the range and central tendency of our limited data set.
The goal of a posterior predictive check is to check whether a model is able to simulate data that is similar to the data used to create the model. The posterior predictive check gives insight to whether the assumptions that a Y outcome can be written as a linear function and that at any X value, Y varies normally around mu with consistent variability sigma.
Posterior predictive checks can be analyzed by comparing the simulated datasets to the observed dataset and looking for similarities in the spread and typical values of the simulated and observed data. If a model has a significantly different mean, range, and shape then the observed data, we know we have a bad model.
The median absolute error (MAE) tells us the typical difference between observed Y_i values and the Y_i’ means from the model. It helps us judge quality of the model.
The scaled median absolute error measures the typical number of standard deviations that observed Y_i differs from the posterior predictive Y_i’ means. It can be an improvement over MAE without a scale because it standardizes the difference which eases comparisons between models that may have different portions of the data we’ve used to train and test.
The within-50 statistic tells us what proportion of observed Y_i that falls within 50% of the posterior prediction interval. If this number is really low we know there’s something off about the model.
In pp_check() the darker density is the actual observed data. The light blue density represents simulated data from a model.
If model fits well the light blue density will have similar central tendency or tendencies and spread as the dark blue density. A good fitting model will produce a plot like this because the model has the same data structure as the observed data and produces predictions mostly within 2 or 3 standard deviations of the observed value.
If a model fits poorly, it may have a different shape, different central tendency, and different range (either much more less) than the observed data.
When you use the same data to train and test a model, the model has been optimized to capture features in the data it was built with so it produces an overly optimistic predictive accuracy.
My questions are how to generate how to split folds in R and in what sociological context would we use leave-one-out.
Load data needed for exercises.
library(bayesrules)
data("coffee_ratings")
coffee_ratings <- coffee_ratings |>
select(farm_name, total_cup_points, aroma, aftertaste)
view(coffee_ratings)
head(coffee_ratings, 15)
## # A tibble: 15 × 4
## farm_name total_cup_points aroma aftertaste
## <fct> <dbl> <dbl> <dbl>
## 1 "metad plc" 90.6 8.67 8.67
## 2 "metad plc" 89.9 8.75 8.5
## 3 "san marcos barrancas \"san cristobal cuch" 89.8 8.42 8.42
## 4 "yidnekachew dabessa coffee plantation" 89 8.17 8.42
## 5 "metad plc" 88.8 8.25 8.25
## 6 <NA> 88.8 8.58 8.42
## 7 <NA> 88.8 8.42 8.33
## 8 "aolme" 88.7 8.25 8.5
## 9 "aolme" 88.4 8.67 8.58
## 10 "tulla coffee farm" 88.2 8.08 8.5
## 11 "fahem coffee plantation" 88.1 8.17 8.25
## 12 "el filo" 87.9 8.25 8.17
## 13 "los cedros" 87.9 8.08 8.33
## 14 "arianna farms" 87.9 8.33 8.08
## 15 "aolme" 87.8 8.25 8.5
Modeling ratings by aroma likely violates the independence assumption (conditioned on X, observed data Y_i on case i is independent of the observed data on any other case j) because given X = aroma and Y = total_cup_points a reviewer is likely comparing i cup of coffee to j cup of coffee to make determination on aroma. Similarly with aftertaste as X and Y remaining total_cup_points, a taster may evaluate aftertaste of i depending on aftertaste of j. In other words, if you’ve never tried anything other than Folgers or my favorite, Nescafe Clasico, then you won’t know if it has better or worse aroma or aftertaste than another brand.
set.seed(84735)
new_coffee <- coffee_ratings |>
group_by(farm_name) |> #telling R how to group data for new set
sample_n(1) |> #telling R to take 1 sample from each group
ungroup()
dim(new_coffee) #new_coffee has 572 observations with 4 variables
## [1] 572 4
lm(total_cup_points ~ aroma, data = new_coffee) #one estimate of parameters for a well fitting line; we see slope estimate (6.163) is positive
##
## Call:
## lm(formula = total_cup_points ~ aroma, data = new_coffee)
##
## Coefficients:
## (Intercept) aroma
## 35.397 6.163
ggplot(new_coffee, aes(y = total_cup_points, x = aroma)) + #plotting aroma on x axis and points on y axist
geom_point(size = 0.5) +
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
The relationship between aroma and coffee rating is positive. As aroma grade increases total_cup_points increases. One model indicates that for each unit increase in aroma, cup points increase by about 6 units.
coffee_model <- stan_glm(total_cup_points ~ aroma, data = new_coffee,
family = gaussian,
prior_intercept = normal(75, 10), #used range rule for sd and 75 was given
prior = normal(7.5, .2), #7.5 seems like a reasonable mu for B_1 based on scanning the data
prior_aux = exponential(0.1),#found l by 1/10
chains = 4, iter = 5000*2, seed = 84735)
#Store 4 chains for each parameter in 1 data frame as we might need this later.
coffee_model_df <- as.data.frame(coffee_model)
#in this step I ahve a numerical psterior summary for aroma
tidy(coffee_model, effects = c("fixed", "aux"),
conf.int = TRUE, conf.level = 0.8)
## # A tibble: 4 × 5
## term estimate std.error conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 29.0 1.20 27.4 30.5
## 2 aroma 7.01 0.158 6.81 7.21
## 3 sigma 1.98 0.0590 1.90 2.05
## 4 mean_PPD 82.1 0.117 81.9 82.2
# 50 simulated model lines to provide visual summary of B_1 coefficient aroma
new_coffee |>
add_fitted_draws(coffee_model, n = 50) |>
ggplot(aes(x = aroma, y = total_cup_points)) +
geom_line(aes(y = .value, group = .draw), alpha = 0.7, color = "grey") +
geom_point(data = new_coffee, size = 0.05, color = "blue")
## Warning: `fitted_draws` and `add_fitted_draws` are deprecated as their names were confusing.
## Use [add_]epred_draws() to get the expectation of the posterior predictive.
## Use [add_]linpred_draws() to get the distribution of the linear predictor.
## For example, you used [add_]fitted_draws(..., scale = "response"), which
## means you most likely want [add_]epred_draws(...).
#plot shows 50 posterior plausible sets of B_0 and B_1 (aroma). Notice that aroma is less plausible below 7.0.
The posterior median relationship is 28.975120 + 7.010832X meaning that for every one unit increase in aroma we expect total_cup_points to increase by 7.010832.
I would say that yes I do based on my plots in a and c that show a positive correlation between aroma and cup score. The posterior summary of aroma shows that the 80% credible interval is above zero (6.8 - 7.2).
#finding first set of the posterior plausible parameter sets (B_0, B_1, sigma)
first_set <- head(coffee_model_df, 1)
first_set
## (Intercept) aroma sigma
## 1 27.27295 7.241494 1.909072
#from observed data on each of the 572 observations in the new_coffee data we simulate a total_cup_points outcome from the normal data model tuned to this first parameter set
beta_0 <- first_set$`(Intercept)`#remember that beta_0 is intercept
beta_1 <- first_set$aroma #x values
sigma <- first_set$sigma
set.seed(84735)
one_simulation <- new_coffee |>
mutate(mu = beta_0 + beta_1 * aroma, #this is linear equation with data from the first set
simulated_points = rnorm(572, mean = mu, sd = sigma)) |> #remember 572 is sample size
select(aroma, total_cup_points, simulated_points)
head(one_simulation, 3)
## # A tibble: 3 × 3
## aroma total_cup_points simulated_points
## <dbl> <dbl> <dbl>
## 1 7.67 84 84.1
## 2 7.33 76.2 80.1
## 3 6.75 67.9 77.6
Density plot.
ggplot(one_simulation, aes(x = simulated_points)) +
geom_density(color = "lightblue") +
geom_density(aes(x = total_cup_points), color = "darkblue")
#examine 100 of the 20000 simulated points
pp_check(coffee_model, nreps = 100) +
xlab("points")
new_coffee |>
filter(farm_name == "-") |> #the first batch only had a hyphen for a name but this filter still worked. I could have also just looked at the data to see the aroma and points since these values are first in the list
select(aroma, total_cup_points)
## # A tibble: 1 × 2
## aroma total_cup_points
## <dbl> <dbl>
## 1 7.67 84
#simulate posterior predictive model for first farm.
set.seed(84735)
predict_firstfarm <- coffee_model_df |>
mutate(mu = `(Intercept)` + aroma*7.67,
point_new = rnorm(20000, mean = mu, sd = sigma))
#plot this model
ggplot(predict_firstfarm, aes(x = point_new)) +
geom_density() +
geom_vline(xintercept = 84, color = "lightgreen") #added this line to show where score for first farm in observed data = 84
#raw predictive error
predict_firstfarm |>
summarize(mean = mean(point_new), error = 84 - mean(point_new))
## mean error
## 1 82.74027 1.259727
The model under-predicted the points by 1.259727.
#standardized posterior error
predict_firstfarm |>
summarize(sd = sd(point_new), error = 84 - mean(point_new),
error_scaled = error / sd(point_new))
## sd error error_scaled
## 1 1.97503 1.259727 0.6378268
The score of 84 we observed in the new_coffee data is 1.97503 standard deviations above the mean prediction.
#in this case we'll use posterior_predict
coffpredictions <- posterior_predict(coffee_model, newdata = new_coffee)
dim(coffpredictions)
## [1] 20000 572
ppc_intervals(new_coffee$total_cup_points, yrep = coffpredictions, x = new_coffee$aroma, prob = 0.5, prob_outer = 0.95, size = 0.8) #adjusted size to make blue lines easier to see
For each of the 572 total_cup_points, the 95% prediction interval is displayed as the narrow long blue bars and the 50% prediction interval is show in the wider short blue bars. The posterior predictive median is the light blue dot.What we are seeing is the center and spread of each posterior predictive model which informs us of how compatible each model is with the observed points (dark blue dots). Visually it looks like few of the observed points are outside of the 50% or 95% prediction intervals which means I can have confidence in the accuracy of the model.
set.seed(84735)
prediction_summary(coffee_model, data = new_coffee)
## mae mae_scaled within_50 within_95
## 1 0.9414586 0.4794058 0.6520979 0.958042
65.20979% of batches have ratings that are within their 50% posterior prediction interval.
set.seed(84735)
cv_procedurecoff <- prediction_summary_cv(
model = coffee_model, data = new_coffee, k = 10) #used 10 because of Goldilocks challenge
cv_procedurecoff
## $folds
## fold mae mae_scaled within_50 within_95
## 1 1 0.9835034 0.4906481 0.6034483 0.9310345
## 2 2 1.0054881 0.5200194 0.5263158 0.9298246
## 3 3 0.9521365 0.4646638 0.7192982 0.9649123
## 4 4 1.0410444 0.5109289 0.6140351 0.9824561
## 5 5 0.9363774 0.5297648 0.6315789 0.9473684
## 6 6 0.9295120 0.4599348 0.7368421 0.9824561
## 7 7 0.7660163 0.3831139 0.6842105 0.9649123
## 8 8 1.0092968 0.5084666 0.6140351 0.9298246
## 9 9 0.8018329 0.3974392 0.6842105 0.9824561
## 10 10 0.6138494 0.3039137 0.6379310 0.9655172
##
## $cv
## mae mae_scaled within_50 within_95
## 1 0.9039057 0.4568893 0.6451906 0.9580762
Median absolute error (mae) tells us that the typical difference between observed points and posterior predictive mean is .903. The scaled MAE tells us the typical number of standard deviations that the observed points fall from their posterior predictive means is .456 standard deviations. 64.52% of the observed points fall within 50% of the posterior prediction interval. 95.81% of the observed points fall within the 95% posterior prediction interval.
Looking at the MAE among the folds we see that it ranged from being as low as 0.613 to as high as 1.04. Looking at the variation among the four cross-validated metrics shows how some training models perform better on some sets than others. The within_50 metric was as low as 0.526 for one fold. The metrics found from the average of these 10 folds seems to represent the trend of the folds well.
lm(total_cup_points ~ aftertaste, data = new_coffee) #one estimate of parameters for a well fitting line; we see slope estimate (6.623) is positive
##
## Call:
## lm(formula = total_cup_points ~ aftertaste, data = new_coffee)
##
## Coefficients:
## (Intercept) aftertaste
## 33.121 6.623
ggplot(new_coffee, aes(y = total_cup_points, x = aftertaste)) + #plotting aroma on x axis and points on y axis
geom_point(size = 0.5) +
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
aftertaste_model <- stan_glm(total_cup_points ~ aftertaste, data = new_coffee,
family = gaussian,
prior_intercept = normal(75, 10), #kept same prior for rating from earlier exercise
prior = normal(7.4, .3), #7.4 seems like a reasonable mu for B_1 based on second visualization for this problem .3 is an estimate of the range rule vale
prior_aux = exponential(0.1),#found l by 1/10
chains = 4, iter = 5000*2, seed = 84735)
pp_check(aftertaste_model, nreps = 100) +
xlab("total_cup_points")
#finding 10-fold cross validation of the posterior predictive quality
set.seed(84735)
cv_procedureaf <- prediction_summary_cv(
model = aftertaste_model, data = new_coffee, k = 10)
cv_procedureaf
## $folds
## fold mae mae_scaled within_50 within_95
## 1 1 0.5291000 0.3272169 0.7758621 0.9827586
## 2 2 0.7338549 0.4742131 0.7017544 0.9473684
## 3 3 0.7298899 0.4483136 0.7192982 1.0000000
## 4 4 0.5269603 0.3262044 0.8596491 0.9649123
## 5 5 0.5482930 0.3938378 0.7017544 0.9473684
## 6 6 0.6901362 0.4327767 0.6315789 0.9824561
## 7 7 0.8247415 0.5229725 0.7017544 0.9298246
## 8 8 0.7082116 0.4511447 0.6666667 0.9649123
## 9 9 0.9796816 0.6103597 0.5789474 0.9824561
## 10 10 0.8477898 0.5358133 0.7241379 0.9655172
##
## $cv
## mae mae_scaled within_50 within_95
## 1 0.7118659 0.4522853 0.7061404 0.9667574
First I’ll visualize the data to inform some priors.
lm(maxtemp ~ mintemp, data = weather_perth) #the second value is one possible slope
##
## Call:
## lm(formula = maxtemp ~ mintemp, data = weather_perth)
##
## Coefficients:
## (Intercept) mintemp
## 14.4681 0.8355
ggplot(weather_perth, aes(y = maxtemp, x = mintemp)) +
geom_point(size = 0.5) +
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
Next I’ll do stan_glm()
temp_model <- stan_glm(maxtemp ~ mintemp, data = weather_perth,
family = gaussian,
prior_intercept = normal(25, 11), #points seem to cluster around 25 on maxtemp and used estimate of range rule for sd
prior = normal(15, 7.5), #points seem to cluster around 15 on mintemp and used estmate of rangerule for sd
prior_aux = exponential(0.09),#found l by 1/11
chains = 4, iter = 5000*2, seed = 84735)
#Store 4 chains for each parameter in 1 data frame
temp_model_df <- as.data.frame(temp_model)
Use tidy() to for a posterior summary of mintemp.
#in this step I ahve a numerical posterior summary for mintemp
tidy(temp_model, effects = c("fixed", "aux"),
conf.int = TRUE, conf.level = 0.8)
## # A tibble: 4 × 5
## term estimate std.error conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 14.5 0.380 14.0 14.9
## 2 mintemp 0.836 0.0264 0.802 0.870
## 3 sigma 4.30 0.0964 4.18 4.42
## 4 mean_PPD 25.6 0.192 25.3 25.8
Note credible intervals are above 0.
The posterior median relationship is 14.47 + 0.84X meaning that for every 1 degree increase in the mintemp, we can expect that maxtemp to increase by 0.84 degrees.
pp_check(temp_model, nreps = 50) +
xlab("maxtemp")
Notice the model doesn’t capture bimodality of the maxtemp that could represent seasonal changes or something else (maybe wildfires?). It captures the range, but has a slightly larger one.
predictions <- posterior_predict(temp_model, newdata = weather_perth)
dim(predictions)
## [1] 20000 1000
ppc_intervals(weather_perth$maxtemp, yrep = predictions, x = weather_perth$mintemp, prob = 0.5, prob_outer = 0.95)
prediction_summary(temp_model, data = weather_perth)
## mae mae_scaled within_50 within_95
## 1 3.23352 0.7519272 0.458 0.969
Note that 45% of observed maxtemps fall within the 50% posterior prediction interval.
cv_proceduret <- prediction_summary_cv(
model = temp_model, data = weather_perth, k = 10)
cv_proceduret
## $folds
## fold mae mae_scaled within_50 within_95
## 1 1 3.237666 0.7491671 0.44 1.00
## 2 2 2.667141 0.6121590 0.55 0.97
## 3 3 3.200698 0.7478611 0.47 0.96
## 4 4 3.657437 0.8669770 0.36 0.96
## 5 5 3.389756 0.7879079 0.44 0.95
## 6 6 3.342260 0.7779194 0.41 0.94
## 7 7 3.103401 0.7211185 0.48 0.99
## 8 8 3.062961 0.7127143 0.48 0.98
## 9 9 3.182260 0.7374859 0.46 0.97
## 10 10 3.146443 0.7258124 0.48 0.96
##
## $cv
## mae mae_scaled within_50 within_95
## 1 3.199002 0.7439123 0.457 0.968
This model has ok or useful predictive accuracy and captures the generally positive relationship between min temp and max temp. However, we know the model could be improved because of the below 50% within_50 metrics and because it doesn’t capture the bimodality of the sample data.
First I’ll visualize the data to inform some priors.
lm(rides ~ humidity, data = bikes) #the second value is one possible slope
##
## Call:
## lm(formula = rides ~ humidity, data = bikes)
##
## Coefficients:
## (Intercept) humidity
## 3933.722 -7.117
ggplot(bikes, aes(y = rides, x = humidity)) +
geom_point(size = 0.5) +
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
Notice that there’s a weakly negative association between humidity and rides. As humidity increases rides decrease slightly.
Next I’ll do stan_glm()
humid_model <- stan_glm(rides ~ humidity, data = bikes,
family = gaussian,
prior_intercept = normal(5000, 1000), #using priors from ch. 9 about the average and sd of rides on an average temp day
prior = normal(63, 15), #most of the humidity values are between 50 and 75 so I choose 63 as kind of a midpoint. Used range rule approx. for sd
prior_aux = exponential(0.0008),#used prior from chp. 9
chains = 4, iter = 5000*2, seed = 84735)
#Store 4 chains for each parameter in 1 data frame
humid_model_df <- as.data.frame(humid_model)
# numerical posterior summary for humidity
tidy(humid_model, effects = c("fixed", "aux"),
conf.int = TRUE, conf.level = 0.8)
## # A tibble: 4 × 5
## term estimate std.error conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 3559. 295. 3190. 3933.
## 2 humidity -1.06 4.50 -6.75 4.60
## 3 sigma 1576. 50.1 1513. 1643.
## 4 mean_PPD 3492. 101. 3362. 3620.
The posterior median relationship between riders and humidity is 3558.56 + -1.06X meaning that for each unit increase in humidity, riders decrease by 1.06.
To evaluate my model I will do a pp_check, ppc_intervals, and cross-validation.
#first pp_check
pp_check(humid_model, nreps = 50) +
xlab("rides")
#ppc_intervals
humidpredictions <- posterior_predict(humid_model, newdata = bikes)
dim(humidpredictions)
## [1] 20000 500
ppc_intervals(bikes$rides, yrep = humidpredictions, x = bikes$humidity, prob = 0.5, prob_outer = 0.95)
prediction_summary(humid_model, data = bikes)
## mae mae_scaled within_50 within_95
## 1 1189.441 0.750041 0.462 0.966
#cross-validation
cv_procedureh <- prediction_summary_cv(
model = humid_model, data = bikes, k = 10)
cv_procedureh
## $folds
## fold mae mae_scaled within_50 within_95
## 1 1 908.4528 0.5703438 0.58 0.96
## 2 2 1159.7020 0.7419827 0.48 0.98
## 3 3 1233.6827 0.7781523 0.44 1.00
## 4 4 1280.5089 0.8075703 0.40 0.98
## 5 5 1242.0071 0.7886023 0.46 0.98
## 6 6 776.6916 0.4883848 0.60 0.94
## 7 7 1077.5036 0.6757442 0.50 0.98
## 8 8 1711.2308 1.0873563 0.34 0.98
## 9 9 1396.8398 0.8888582 0.44 0.90
## 10 10 1373.2332 0.8769758 0.38 0.96
##
## $cv
## mae mae_scaled within_50 within_95
## 1 1215.985 0.7703971 0.462 0.966
The ppcheck shows the model captures the range and central tendency mostly well; it doesn’t capture that the observed rides have bimodality. Note that 46.2% of observed rides fall within the 50% posterior prediction interval based on the ppc_intervals function.
The cross validation shows similar within_50 metrics and the mae_scaled shows a low typical number of standard deviations that the observed rides fall from the posterior predictive mean. Overall this model is ok at capturing the weakly negative relationship between humidity and rides. An improved model would capture the bimodality of the observed data. We should keep in mind that a lot of variability in this model comes from the widely dispersed observed data; the ppc_intervals have wide bands that reflect this.